The assignment for the Bioinformatician (Pipeline Development) position at SOPHiA GENETICS was composed of three main parts: read coverage, variant calling and analytical performance. The assignment is based on a targeted capture sequencing experiment in which paired-end reads were generated for one sample at selected region of chromosome 19 of the human genome hg19.
In this report, I first provide a brief background about capture sequencing technology with respect to the analysis of human diseases, I then describe a bioinformatics workflow that I have implemented specific for this assignment and finally, I lay out the results that were generated in combination with my interpretation. The pipeline and the scripts have been uploaded to GitHub at this link

Background

Whole-exome sequencing analyzes the exons or coding regions of thousands of genes simultaneously using next-generation sequencing techniques. By sequencing the exomeof apatient and comparing itwitha normal reference sequence, variations in an individual’sDNAsequence can be identified and related back to the individual’s medical concerns in an effort to discover the cause of the medical disorder.

Methods

Results and Discussion

(1) Read coverage

In Next Generation Sequencing (NGS) studies, the term “coverage” describes the number of times that a nucleotide or base in the reference is covered by a high-quality aligned read. A reference can be a whole genome, a specific chromosome or targeted genomic regions and the number of reads that align to, or “cover” it is known as redundancy of coverage and also as the depth or the depth of coverage (Sims et al. 2014). Coverage has also been used to denote the breadth of coverage of a target genome, which is defined as the percentage of target bases that are sequenced a given number of times. For example, a genome sequencing study may sequence a genome to 30X average depth and achieve a 95% breadth of coverage of the reference genome at a minimum depth of ten reads.
In this assignment, the reference was a region of chromosome 19 of the human genome hg19 and the first question was:
What can you say about the quality of the read coverage in the sample?

First, to reduce biased coverage estimates, I have identified and removed read pairs that were likely to have originated from duplicates of the same original DNA fragments through some artifactual processes during sample preparation (e.g., library construction using PCR). For the given sample, ~25% of the aligned reads were estimated to be duplicates and were removed using Picard MarkDuplicates (Broad Institute 2018). The removal of duplicate reads is expected to reduce non-uniformity of coverage and thus, increase the power to detect variants (Kozarewa et al. 2009). The data should also be corrected for patterns of systematic errors in the base quality scores which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it’s important to correct any systematic bias observed in the data. However, because in part 3 of the assignment it was asked to compare between variant calls (i.e., ground truth vs. variants called by the pipeline), I have performed the base quality recalibration step at a later stage of the analysis. I have integrated the statistics of analytical performance (sensitivity, precision and specificity) with information from the recalibration step in order to give an additional view on potential reasons for the difference between the two variant calls.
After filtering out duplicate reads, I have calculated both the redundancy and breadth of coverage for high-quality bases (\(\geq\)Q20) from high-quality alignments (\(\geq\)MapQ30), within the region of chromosome 19 of the human genome hg19 using SAMtools (Li et al. 2009) and deepTools (Ramı́rez et al. 2016).

# Extract chromosome name, first and last position of the aligned read from the BAM file without duplicates
samtools view ../Files_needed_for_task/chr19_dedup_sort.bam | awk '{print $3 "\t" $4 "\t" $4+length($10)-1}' > results/1_read_coverage/chr19_interval.txt

# Print on the screen the first and the last position of the chr19 region
sort -nk2 results/1_read_coverage/chr19_interval.txt | awk 'NR==1{print "min:"$2} END{print "max:"$3}'

The breadth of coverage or the percentage of the chr19 genomic region that was covered by, for example, a minimum of ten reads was 15%. Thus, the selected interval of chr19 was overall poorly covered (low breadth of coverage) and on average, there were 13 reads per base pair, meaning an average of 13X coverage (Fig. 1 top panel). Even after the removal of duplicate reads, the overall coverage was still non-uniform suggesting that this portion of chr19 is prone to coverage fluctuations due to biases connected with sample preparation, sequencing, genomic alignment and/or assembly.

For single sample sequencing, which is the case of this assignment, an average of 30-35X coverage has been the standard minimum for generating a reliable variant call (Ajay et al. 2011; Sims et al. 2014). Given that the overall fraction of the chr19 region was scarcely covered and that the average coverage was much lower than the current standard, the variant calling output for the given sample should be treated with caution. I expect the variant calling output to contain variants that are not present in the ground truth (false positives) as well as to lack variants that are instead present in the ground truth (false negatives). I expect, of course, to find true positives and true negatives using this pipeline but it will be less sensitive, less precise and less specific at calling variants than the pipeline used for generating the ground truth.

# Estimate number of reads per high-quality bases (q20) from high-quality alignments (Q30) 
samtools depth -a -q 20 -Q 30 -r 19:60004-14992498 ../Files_needed_for_task/chr19_dedup_sort.bam > results/1_read_coverage/chr19_depth.txt

cut -f 3 results/1_read_coverage/chr19_depth.txt | sort | uniq -c > results/1_read_coverage/chr19_cov_redundancy.txt

# Mean coverage
awk '{ sum += $3 } END { if (NR > 0) print sum / NR }' results/1_read_coverage/chr19_depth.txt
library(ggplot2)
library(plotly)
library(purrr)
library(dplyr)
library(patchwork)

# Read coverage per base data (output DeepTools plotCoverage)
deepT <- read.table("results/1_read_coverage/chr19_coverage.txt")
deepB <- read.table("results/1_read_coverage/chr19_coverage_target.txt")

# Prepare data for figures
deepT <- merge(x = deepT, y = deepB, by = c("V1", "V2", "V3"))
colnames(deepT) <- c("chr", "start", "end", "xxx", "intarget", "outarget")
# Count number of bases with the same depth
deepC <- apply(X = deepT[, -1:-3], MARGIN = 2, FUN = function(y) {
   aggregate(x = deepT$chr, by = list(y), length)
})

# Merge list of dataframes
ad <- deepC %>% reduce(left_join, by = "Group.1")
colnames(ad) <- c("depth", "xxx", "intarget", "outarget") 
# Replace NAs with 0
ad$outarget <- ifelse(test = is.na(ad$outarget), yes = 0, no = ad$outarget)

# Calculate "reverse cumulative sum"
calc_rev <- function(dat, coln, dep_val) {
   orow <- dat[dat[, 1] %in% dep_val, ]
   for (i in 1:nrow(orow)) {
      if (as.numeric(orow[i, 1])==0) {
         orow[i, paste(coln, "perc", sep = "_")] <- (orow[i, coln]/sum(dat[, coln])) * 100
      } else {
         orow[i, paste(coln, "perc", sep = "_")] <- (sum(dat[dat[,1]>=as.numeric(orow[i, 1]), coln])/sum(dat[, coln]))*100
      }
   }
   return(orow[, paste(coln, "perc", sep = "_")])
}
ad$xxx_perc <- calc_rev(dat = ad, coln = "xxx", dep_val = ad$depth)
ad$intarget_perc <- calc_rev(dat = ad, coln = "intarget", dep_val = ad$depth)
ad$outarget_perc <- calc_rev(dat = ad, coln = "outarget", dep_val = ad$depth)

# Select values of read coverage for the figure
brks <- c(0:10, 15, seq(20,100,length.out = 9), 125, seq(150, 350,by = 50))
red_bin <- ad[ad$depth %in% brks, 1:4]

# Calculate fraction of bases covered
red_bin$xxx_prop <- red_bin$xxx/sum(red_bin$xxx)
red_bin$in_prop <- red_bin$intarget/sum(red_bin$intarget)
red_bin$out_prop <- red_bin$outarget/sum(red_bin$outarget)

dtn <- data.frame(depth=rep(red_bin$depth,3),
                  prop=c(red_bin$xxx_prop, red_bin$in_prop, red_bin$out_prop),
                  pal=c(rep("xxx", nrow(red_bin)),
                        rep("intarget", nrow(red_bin)),
                        rep("outarget", nrow(red_bin))))
# Plot fraction of bases against read coverage without the zero coverage
ggplot(data = dtn[dtn$depth!=0, ], col = "black") +
   geom_histogram(aes(x = as.factor(depth), y = prop, fill = pal), stat = "identity",
                  position = "dodge") +
   theme_bw() +
   scale_fill_manual(values = c("#ff7f0e", "#2ca02c", "#1f77b4")) +
   labs(x = "Coverage (# reads per bp)", y = "Fraction of bases", fill = "") +
   theme(axis.title = element_text(size = 16), axis.text = element_text(size = 9),
         legend.position = "none") +
   annotate("text", y = 0.038, x = 25, label="Overall mean = 13X", hjust=1, vjust=1, size = 6, col = "#1f77b4") +
   annotate("text", y = 0.031, x = 25, label="Target mean = 11X", hjust=1, vjust=1, size = 6, col = "#ff7f0e") +
   annotate("text", y = 0.024, x = 25, label="Off-target mean = 2X", hjust=1, vjust=1, size = 6,
            col = "#2ca02c")

# Plot breadth against depth of coverage
fig <- plot_ly(ad[ad$depth<=200,], x = ~depth)
fig <- fig %>% add_trace(y = ~xxx_perc, name = 'Overall', type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y = ~intarget_perc, name = 'Target', type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y = ~outarget_perc, name = 'Off-target', type = 'scatter', mode = 'lines')
fig <- fig %>% layout(xaxis = list(title = "Coverage (# reads per bp)", titlefont = list(size = 18)),
                      yaxis = list(title = "Percentage of bases >= coverage", titlefont = list(size = 18)),
                      legend = list(font = list(size = 16)))
fig

Figure 1. Read coverage for the overall region of chromosome 19, for the target and off-target region. Top panel: Values of the three means of coverage and fraction of bases of the found read depths, but not for all the depths. Bottom panel (interactive): Fraction of the chr19 region or breadth of coverage that was covered with a certain number of reads.

 

One important point must be made and that is about the statistic that it is often used for predicting the quality of the variant calling analysis. The mean of a continuous variable is mostly meaningful when the values are normally distributed and the variance is homogeneous. In NGS studies where (e.g., Prabhu and Pe’er 2009; Yoon et al. 2009) coverage follows a normal distribution and it is homoscedastic, the mean is then a valid proxy for depth of coverage and thus, for predictive the overall power of variant calling. The same criterion applies in studies where coverage is high and uniform across the sequence of reference However, coverage data is rarely normally distributed nor homoscedastic suggesting that the mean is not a very appropriate proxy for depth of coverage and thus, may neither be strongly predictive for the reliability of the variant calling output.

The count of the number of overlapping reads tells us how many bases are covered how many times.

it was due to the fact that many of my target regions exhibited high sequence similarity to other regions of the genome. Thus, reads that should have gone to my target regions were being ‘robbed’ by other regions of similar sequence. This, coupled with other target regions that had relatively ‘unique’ sequence, results in a distorted coverage profile much more exaggerated than the profile you’ve posted. It isn’t necessarily a problem and I would say that your data is fine. However, set a high mapping quality in order to ensure that your reads have high probability of mapping to the regions that you expect the to.

 

Figure. GC content. Figure. GC content.

 

The transition/transversion ratio (Ti/Tv ratio) is the proportion of the variants observed as transitions (between purines, or between pyrimidines) versus transversions (between purines and pyrimidines). The Ti/Tv ratio is particularly useful for assessing the quality of single nucleotide polymorphisms inferred from sequencing data [31, 32]. A higher ratio generally indicates higher accuracy [27] (Jiang et al. 2019).

(2) Variant calling

SNPs as presented in public databases like dbSNP (32,33) or HapMap (34) are germline variations for which at most population frequencies are known. In literature it is usually assumed that the variation should be found in more than 1% of the population in order to be called a SNP. Such information is very useful for biomarker development since it describes the prevalence of the mutation in different populations. However, it is normally not possible to get additional information (like gender, age, or disease status) on the individuals having the SNP, only the population a person belongs to is given. Since it is not known if the information comes from a tumor or normal sample, a correlation between diseases and SNPs cannot be calculated.

(3) Analytical performance

References

Ajay, S. S., S. C. Parker, H. O. Abaan, K. V. Fajardo, and E. H. Margulies. 2011. Accurate and comprehensive sequencing of personal genomes. Genome Research 21:1498–505.
Broad Institute. 2018. Picard Tools. http://broadinstitute.github.io/picard/.
Jiang, Y., Y. Jiang, S. Wang, Q. Zhang, and X. Ding. 2019. Optimal sequencing depth design for whole genome re-sequencing in pigs. BMC Bioinformatics 20:556.
Kozarewa, I., Z. Ning, M. A. Quail, M. J. Sanders, M. Berriman, and D. J. Turner. 2009. Amplification-free illumina sequencing-library preparation facilitates improved mapping and assembly of (g+c)-biased genomes. Nature Methods 6:291–5.
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079.
Prabhu, S., and I. Pe’er. 2009. Overlapping pools for high-throughput targeted resequencing. Genome Research 19:1254–61.
Ramı́rez, F., D. P. Ryan, B. Grüning, V. Bhardwaj, F. Kilpert, A. S. Richter, S. Heyne, F. Dündar, and T. Manke. 2016. deepTools2: A next generation web server for deep-sequencing data analysis. Nucleic Acids Research 44:W160–W165.
Sims, D., I. Sudbery, N. E. Ilott, A. Heger, and C. P. Ponting. 2014. Sequencing depth and coverage: Key considerations in genomic analyses. Nature Review Genetics 15:121–32.
Yoon, S., Z. Xuan, V. Makarov, K. Ye, and J. Sebat. 2009. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research 19:1586–92.